Increasing Software Reliability through Rollback and On-line Fault Repair
نویسندگان
چکیده
In this paper, we propose a new paradigm for increasing the reliability of a software system by combining reactive and proactive approaches. The proposed approach employs rollback and restart for masking transient failures, and employs on-line software version change to remove faults from the software. A model for reliability analysis of a system employing the proposed approach is presented. The analysis shows that substantial benefit in reliability can be obtained by employing the proposed approach. A prototype system which incorporates the proposed approach is also described.
منابع مشابه
Empowering Software Debugging Through Architectural Support for Program Rollback
This paper proposes the use of processor support for program rollback, as a key primitive to enhance software debugging in production-run environments. We discuss how hardware support for program rollback can be used to characterize bugs on-the-fly, leverage code versioning for performance or reliability, sandbox device drivers, collect monitoring information with very low overhead, support fai...
متن کاملTransient and Intermittent Fault Recovery without Rollback
Increasing chip density combined with heightened reliability expectations has spawned greater interest in fault tolerant design. In recent years, research into rollback and retry techniques has established them as an e ective approach to recovery from transient and intermittent faults. For applications with strict timing requirements, however, the high error latency inherent in retry approaches...
متن کاملEncore: Low-Cost, Fine-Grained Transient Fault Recovery
To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of device reliability. Given the rise of processor (un)reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques...
متن کاملSystem Reliability of Fault Tolerant Data Center Rev4
A single point of failure (SPOF) in system operations is a weak point of system reliability. Mean time to failure (MTTF) of system operations is equal to the shortage component’s MTTF in system. A Tier IV data center is designed to eliminate the SPOF. Data center system reliability is not only depended on the MTTF of each component in the system, but also relies on the mean time to repair (MTTR...
متن کاملDouble phase fault location in microgrids with the presence of electric vehicles and Distributed parameters line model
Nowadays, renewable energy is increasingly used in smart grids and microgrids to reduce the use of fossil fuels and improve network efficiency. Like all power system devices, microgrids are subject to transient and steady-state faults, such as short circuits. These faults impair reliability and consumer dissatisfaction. To accurately, automatically, and economically determine the location of a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997